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Abstract 

A statistically significant research finding should not 
be defined as a /’-value of 0.05 or less, because this 
definition does not take into account study power. 
Statistical significance was originally defined by Fisher 
RA as a /’-value of 0.05 or less. According to Fisher, any 
finding that is likely to occur by random variation no more 
than 1 in 20 times is considered significant. Neyman J and 
Pearson ES subsequently argued that Fisher's definition 
was incomplete. They proposed that statistical significance 
could only be determined by analyzing the chance of 
incorrectly considering a study finding was significant (a 
Type I error) or incorrectly considering a study finding 
was insignificant (a Type n error). Their definition of 
statistical significance is also incomplete because the error 
rates are considered separately, not together. A better 
definition of statistical significance is the positive predictive 
value of a /’-value, which is equal to the power divided by 
the sum of power and the /’-value. This definition is more 
complete and relevant than Fisher's or Neyman-Peason's 
definitions, because it takes into account both concepts of 
statistical significance. Using this definition, a statistically 
significant finding requires a /’-value of 0.05 or less when 
the power is at least 95%, and a /’-value of 0.032 or less 
when the power is 60%. To achieve statistical significance, 
/’-values must be adjusted downward as the study power 
decreases. 

Key words: Statistical significance; Positive predictive 
value; Biostatistics; Clinical significance; Power 
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Core tip: Statistical significance is currently defined 
as a /’-value of 0.05 or less, however, this definition is 
inadequate because of the effect of study power. A better 
definition of statistical significance is based upon the 
/’-value's positive predictive value. To achieve statistical 
significance using this definition, the power divided by the 
sum of power plus the /’-value must be 95% or greater. 
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INTRODUCTION _ 

Scientific research has long utilized and accepted that a 
research finding is statistically significant if the likelihood 
of observing the statistical significance equates to P 
< 0.05. In other words, the result could be attributed 
to luck less than 1 in 20 times. If we are testing for 
example, effects of drug A on effect B, we could stratify 
groups into those receiving therapy vs those taking 
placebo vs no pharmacological intervention. If the 
data resulted in a P-value less than 0.05, under the 
generally accepted definition, this would suggest that 
our results are statistically significant. However, it could 
be equally argued that had it resulted in a P-value of 
0.06, or just above the generally accepted cutoff of 0.05, 
it is still statistically significant, but to a slightly lesser 
degree - an index of statistical significance rather than 
a dichotomous yes or no. In that case, further testing 
may be indicated to validate the results but perhaps 
not enough evidence to outright conclude that the null 
hypothesis, drug A has no effect, is accurate in this 
sense. 

The originator of this idea of a statistical threshold 
was the famous statistician R. A. Fisher who in his book 
Statistical Methods for Research Workers, first proposed 
hypothesis testing using an analysis of variance P 
value [1] . In his words, the importance of statistical signi¬ 
ficance in biological investigation is to "prevent us being 
deceived by accidental occurrences" which are "not 
the causes we wish to study, or are trying to detect, 
but a combination of the many other circumstances 
which we can not control" [2] . His argument was that P 
^ 0.05 was a convenient level of standardization to 
hold researchers to, but that it is not a definitive rule as 
an arbitrary number. It is ultimately the responsibility 
of the investigator to evaluate the significance of their 
obtained data and P-value. For example, in some cases, 
a P-value of 0.05 may indicate further investigation is 
warranted while in others that may suffice. 

There were however, opposing viewpoints to this 
idea, namely that of Neyman J and Pearson ES who 
argued for more for a "hypothesis testing" rather than 
"significance testing" as Fisher had postulated [3] . Ney¬ 
man and Pearson [4] raised the question that Fisher failed 
to, namely that with data interpretation there may be 
not only a type I error, but a type II error (accepting 
the null hypothesis when it should in fact be rejected). 
They famously stated "Without hoping to know whether 
each separate hypothesis is true or false, we may 
search for rules to govern our behavior with regard to 
them, in following which we insure that, in the long run 
of experience, we shall not be too often wrong 443 . Part 


of the Neyman-Pearson approach includes researchers 
assigning prior to an experiment, the alternative hypo¬ 
thesis which should be specific such that drug X has Y 
effect by 30% [5] . This hypothesis is later accepted or 
rejected based on the P-value whose threshold was 
arbitrarily set at 0.05. 

These two viewpoints between Neyman-Pearson 
and the more subjective view of Fisher were heavily 
debated and are ultimately recognized as either the 
Neyman-Pearson approach or the Fisher approach. In 
today's academic setting, the determination of statistical 
variance with a P-value has truly become dichotomous, 
either rejection or acceptance based on P < 0.05, 
rather than more of an index of suspicion as Fisher had 
originally proposed. However, an approach of confidence 
based on the P-value could be beneficial rather than a 
definitive decision based on an arbitrary cutoff. 

The meaning and use of statistical significance as 
originally defined by Fisher RA, Jerzy Neyman and Egon 
Pearson has undergone little change in the almost 100 
years since originally proposed. Statistical significance 
as original proposed by Fisher's P-value was the 
determination of whether or not a finding was unusual 
and worthy of further investigation. The Neyman-Pearson 
proposal was similar but slightly different. They proposed 
the concepts of alpha and beta with the alpha level 
representing the chance of erroneously thinking there is 
a significant finding (a Type I error) and the beta level 
representing the chance of erroneously thinking there 
is no significant finding (a Type n error) in the data 
observed [6] . 


CLASSICAL STATISTICAL SIGNIFICANCE 

Statistical significance as currently used represents the 
chance that the null hypothesis is not true as defined 
by the P-value. The classic definition of a statistically 
significant result is when the P-value is less than or equal 
to 0.05, meaning that there is at most a one in twenty 
chance that the test statistic found is due to normal 
variation of the null hypothesis [2] . So when researchers 
state that their findings are "statistically significant" what 
they mean is that if in reality there was no difference 
between the groups studied, their findings would 
randomly occur at most only once out of twenty trials. 

For example, consider an experiment in which 
there is no true difference between a placebo and an 
experimental drug. Because of normal random variation, 
a frequency distribution graph representing the dif¬ 
ference between subjects taking a placebo compared 
with those taking the experimental drug typically forms 
a bell shaped curve [7] . When there is no true difference 
between the placebo and the experimental drug, small 
differences will occur frequently and cluster around 
zero, the center of the peak of the curve. Relatively 
large differences will also occur, albeit infrequently, and 
these results are represented by the upper and lower 
tails of the graph. Assuming the entire area under the 
bell shaped curve equals 1, as represented in Figure 1, 
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Figure 1 According to the classical definition, research findings are considered statistically significant when the difference observed falls in the upper or 
lower tails of the frequency distribution, represented above in black. 



Figure 2 If the observed difference is greater than x, then we consider that the finding is statistically significant and the null hypothesis is rejected. If the 

difference found is less than x, then we accept the null hypothesis and reject the alternative hypothesis. The area in black represents a Type I error which occurs when 
the difference is greater than x, but the null hypothesis is in fact true. The lined area represents a Type II error which occurs when the difference found is less than x, 
but the alternative hypothesis is in fact true. 


the findings are assumed to be statistically significant 
when the difference found falls in either the lower or 
upper 2.5% of the frequency distribution^. 

Note that the classical definition of statistical 
significance according to Fisher relies only upon a sin¬ 
gle frequency distribution curve, representing the null 
hypothesis that no true difference exists between the 
two groups observed [9] . Fisher's approach makes the 
primary assumption that only one group exists, as 
represented by a single frequency distribution curve, 
and P-values (the likelihood of a large difference being 
observed) define statistical significance. The Neyman- 
Pearson approach is slightly different, in that the primary 
assumption is that two groups exist, and two frequency 
distributions are necessary [10] . In this approach, the 
tail of the frequency distribution representing the null 
hypothesis (no difference) is represented by alpha (a). 
Similar to the P-value, alpha represents the chance of 
rejecting the null hypothesis when in fact it is true, a 
Type I error [11] . The tail of the frequency distribution 
representing the alternative hypothesis (a true difference 


exists) is represented by beta (p). Beta represents the 
chance of rejecting the alternative hypothesis when in 
fact it is true, a Type n error. If we are doing a one-tailed 
comparison, e.g., when we assume the experimental 
drug will improve but not hurt patients, alpha and 
beta can be visualized in Figure 2. The area in black 
represents a Type I error and the lined area represents a 
Type II error. 


A NEW DEFINITION OF STATISTICAL 

SIGNIFICANCE_ 

It is time that the statistical significance be defined not 
just as the chance that the null hypothesis is not true (a 
low P-value), or the likelihood of error when accepting 
(a) or rejecting (p) the null hypothesis. While these 
statistics help us evaluate research data, they do not 
give us the odds of being right or wrong, which requires 
that we analyze both the P-value with p together [12] . 

While it is helpful to visualize the concepts of 
alpha and beta on frequency distribution graphs, it is 
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Table 1 Statistically significant research findings can represent 
a true positive or false positive 


Reality 


Study 

findings 


Alternative Null 

hypothesis hypothesis 

true true 

Significant P-value ^ 0.05 True positive False positive 
Insignificant P-value > 0.05 False negative True negative 


Similarly, statistically insignificant findings may represent a true or false 
negative. 


Table 3 A Type I error corresponds to 1 -specificity and a 
Type II error corresponds to 1 -sensitivity when study findings 
are determined to be significant or insignificant based upon the 
P-value 


Reality 


Study 

Alternative 

Null 

findings 

hypothesis 

hypothesis 


true 

true 

Significant P-value ^ 0.05 

Correct 

Type I error 

Insignificant P-value >0.05 

Type II error 

Correct 


Table 2 When the /’-value is utilized to determine whether or 
not a finding is statistically significant, 1-beta represents the 
sensitivity for identifying the alternative hypothesis, and 1 -alpha 
represents the specificity 


Reality 

Study Alternative Null hypothesis 

findings hypothesis true true 

Significant P-value ^ 0.05 1 - beta (power) Alpha (exact 

P-value) 

Insignificant P-value > 0.05 Beta 1 - alpha 


Table 4 This 2x2 contingency table shows the corresponding 
values for a research study where a study finding is determined to 
be significant based upon a /’-value of 0.05 and when the study's 
power is 80% 


Reality 

Study 


Alternative 

Null 

findings 


hypothesis 

hypothesis 



true 

true 


Significant P-value ^ 0.05 

0.8 

0.05 


Insignificant P-value >0.05 

0.2 

0.95 


additionally illuminating to compare these concepts with 
sensitivity, specificity, and predictive values obtained 
from 2x2 contingency tables. In Table 1, the rows 
represent our statistical test results, and the columns 
represent what is actually true. Row 1 represents the 
situation when our data analysis results in a P-value of 
^ 0.05, and row 2 represents the situation when our 
analysis results in a P-value of > 0.05. The columns 
represent reality. Column 1 represents the situation 
when the alternative hypothesis is in reality true, 
and column 2 represents the situation when the null 
hypothesis in reality is true. 

In Table 1, row 1 column 1 are the true positives 
because the P-value is ^ 0.05 and the alternative 
hypothesis is true. Row 1 column 2 are false positives, 
because even though the P-value is ^ 0.05, the reality 
is that there is no significant difference and the null 
hypothesis is true. Similarly, row 2 column 1 are the 
false negatives because the P-value is insignificant (P 
> 0.05) but in reality the alternative hypothesis is true. 
Row 2 column 2 are the true negatives because the 
P-value is insignificant and the null hypothesis is true. 

Table 2 shows our findings in terms of alpha and 
beta. In this case, alpha represents the exact P-value, 
not just whether or not the P-value is ^ 0.05. Beta is 
not only the chance of a Type n error (a false negative), 
it is used to determine the study's power which is simply 
equal to 1 - beta. Table 3 shows the same information 
in another way, showing the situations in which our test 
of statistical significance, the P-value, is in fact correct 
or is in error. 

When we know beta and alpha, or alternatively 
the P-value and power of the study, we can fill out 


the contingency table and answer our real question of 
how likely is it that our findings represent the truth. 
Statistical power, equal to 1 - beta, is typically set 
in advance to help determine sample size. A typical 
level recommended for power is 0.80 [13] . Table 4 is an 
example 2x2 contingency table in the which the study 
has a power of 0.80 and the analysis finds a statistically 
significant result of P = 0.05. In this situation, the 
sensitivity of the test statistic equals the power, or 
0.8/(0.8 + 0.2). The specificity of the test statistic is 
1 minus alpha, or 0.95/(0.05 + 0.95). Our positive 
predictive value is power divided by the sum of power 
and the exact P-value, or 0.80/(0.80 + 0.05). The 
negative predictive value is the specificity divided by the 
sum of the specificity and beta, or 0.95/(0.95 + 0.20). 

To be 95% confident that the P-value represents 
a statistically significant result, the positive predictive 
value must be 95% or greater. In the standard situation 
where the study power is 0.80, a P-value of 0.42 or less 
is required to achieve this level of confidence. As shown 
in Table 5, a power of 0.95 is required for a P-value of 
0.05 to indicate a 95% or greater confidence that the 
study's findings are statistically significant. If the power 
falls to 90%, a P-value of 0.047 or less is required to be 
95% confident that the alternative hypothesis is true 
(/.e., a 95% positive predictive value). If the power is 
only 60%, then a P-value of 0.032 or less is required to 
be 95% confident that the alternative hypothesis is true. 
To determine how likely a study's findings represent the 
truth, determine the positive predictive value (PPV) of 
the test statistic: 

PPV = power/(power + P-value) 

To determine the required P-value to achieve a 95% 
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Table 5 /’-values corrected for study power 


Study power 

/'-value 


0.95 

0.05 

0.9 

0.047 

0.85 

0.045 

0.8 

0.042 

0.75 

0.039 

0.7 

0.037 

0.65 

0.034 

0.6 

0.032 


PPV: 

P-value = (power - 0.95 * power)/0.95 

In the situation where the P-value is greater than 
the cutoff values determined by the preceding method, 
it is helpful to determine just how confident we can be 
that the null hypothesis is correct. This simply entails 
calculating the negative predictive value of the test 
statistic: 

NPV = (1 - alpha)/(l + beta - alpha) 

Finally, using this method we can determine the 
overall accuracy of a research study. Prior to collecting 
and analyzing the research data, pre-set values are 
determined for power and a cutoff P-value for statistical 
significance. If we want to be 95% confident that a 
research study will correctly identify reality, a pre-set 
power of 95% along with a pre-set cutoff P-value of 0.05 
is required. At a pre-set power of 90%, a pre-set cutoff 
P-value of 0.01 is required. When the pre-set power is 
80% or less, the maximum confidence in the accuracy 
of the study findings is at most 90% even when a 
pre-set P-value cutoff is extremely low. To determine 
the maximum level of confidence a study can have at 
a specific level of power and cutoff P-value (alpha), 
calculate the accuracy: 

Accuracy = (1 + power - alpha)/2 


CO NCLUS ION_ 

Statistical significance has for too long been broadly 
defined as a P-value of 0.05 or less [14] . Using the P-value 
alone can be misleading because its calculation does not 
take into account the effect of study power upon the 
likelihood that the P-value represents normal variation 
or a true difference in study populations [15] . If we want 
to be at least 95% confident that a research study has 
identified a true difference in study populations, the 
power must be at least 95%. If the power is lower, the 
required P-value to indicate a statistically significant 
result needs to be adjusted downward according to 
the formula P-value = (power - 0.95*power)/0.95. 
Furthermore, by using the positive predictive value of 


IZ 

Jgnisliirieng 


116 


the P-value, not just the P-value alone, researchers 
and readers are able to better understand the level of 
confidence they can have in the findings and better 
assess clinical relevance [16] . Only when the power of 
a study is at least 95% does a P-value of 0.05 or less 
indicate a statistically significant result. 
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